GFF (General Feature Format) specifications document

2000-9-29 The default version for GFF files is now Version 2. This document has been changed to show version 2 as default, with version one alternatives shown where appropriate.

The main change from Version 1 to Version 2 is the requirement for a tag-value type structure (essentially semicolon-separated .ace format) for any additional material on the line, following the mandatory fields. Version 2 also allows '.' as a score, for features for which there is no score. Dumping in version 2 format is implemented in ACEDB.

Introduction

Essentially all current approaches to feature finding in higher organisms use a variety of recognition methods that give scores to likely signals (starts, splice sites, stops, motifs, etc.) or to extended regions (exons, introns, protein domains etc.), and then combine these to give complete gene, RNA transcript or protein structures. Normally the combination step is done in the same program as the feature detection, often using dynamic programming methods. To enable these processes to be decoupled, a format called GFF ('Gene-Finding Format' or 'General Feature Format') was proposed as a protocol for the transfer of feature information. It is now possible to take features from an outside source and add them in to an existing program, or in the extreme to write a dynamic programming system which only took external features.

GFF allows people to develop features and have them tested without having to maintain a complete feature-finding system. Equally, it would help those developing and applying integrated gene-finding programs to test new feature detectors developed by others, or even by themselves.

We want the GFF format to be easy to parse and process by a variety of programs in different languages. e.g. it would be useful if Unix tools like grep, sort and simple perl and awk scripts could easily extract information out of the file. For these reasons, for the primary format, we propose a record-based structure, where each feature is described on a single line, and line order is not relevant.

We do not intend GFF format to be used for complete data management of the analysis and annotation of genomic sequence. Systems such as Acedb, Genotator etc. that have much richer data representation semantics have been designed for that purpose. The disadvantages in using their formats for data exchange (or other richer formats such as ASN.1) are (1) they require more complexity in parsing/processing, (2) there is little hope on achieving consensus on how to capture all information. GFF is intentionally aiming for a low common denominator.

With the changes taking place to version 2 of the format, we also allow for feature sets to be defined over RNA and Protein sequences, as well as genomic DNA. This is used for example by the EMBOSS project to provide standard format output for all features as an option. In this case the <strand> and <frame> fields should be set to '.'. To assist this transition in specification, a new #Type Meta-Comment has been added.

Here are some example records:

SEQ1  EMBL  atg  103  105  .  +  0
SEQ1  EMBL  exon  103  172  .  +  0
SEQ1  EMBL  splice5  172  173  .  +  .
SEQ1  netgene  splice5  172  173  0.94  +  .
SEQ1  genie  sp5-20  163  182  2.3  +  .
SEQ1  genie  sp5-10  168  177  2.1  +  .
SEQ2  grail  ATG  17  19  2.1  -  0